EU_Social_Progress_Index_2024¶

Fuente: Comisión Europea https://commission.europa.eu/index_en

https://composite-indicators.jrc.ec.europa.eu/explorer

https://composite-indicators.jrc.ec.europa.eu/explorer/explorer/indices/eu-r-spi/eu-regional-social-progress-index

In [1]:
!pip install mapclassify
!pip install geopandas
Requirement already satisfied: mapclassify in c:\users\pablo-pc\anaconda3\lib\site-packages (2.8.0)
Requirement already satisfied: networkx>=2.7 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (3.1)
Requirement already satisfied: numpy>=1.23 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (1.26.4)
Requirement already satisfied: pandas!=1.5.0,>=1.4 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (2.1.4)
Requirement already satisfied: scikit-learn>=1.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (1.2.2)
Requirement already satisfied: scipy>=1.8 in c:\users\pablo-pc\anaconda3\lib\site-packages (from mapclassify) (1.11.4)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas!=1.5.0,>=1.4->mapclassify) (2023.3)
Requirement already satisfied: joblib>=1.1.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from scikit-learn>=1.0->mapclassify) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from scikit-learn>=1.0->mapclassify) (2.2.0)
Requirement already satisfied: six>=1.5 in c:\users\pablo-pc\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas!=1.5.0,>=1.4->mapclassify) (1.16.0)
Requirement already satisfied: geopandas in c:\users\pablo-pc\anaconda3\lib\site-packages (1.0.1)
Requirement already satisfied: numpy>=1.22 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (1.26.4)
Requirement already satisfied: pyogrio>=0.7.2 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (0.9.0)
Requirement already satisfied: packaging in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (23.1)
Requirement already satisfied: pandas>=1.4.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (2.1.4)
Requirement already satisfied: pyproj>=3.3.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (3.6.1)
Requirement already satisfied: shapely>=2.0.0 in c:\users\pablo-pc\anaconda3\lib\site-packages (from geopandas) (2.0.5)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas>=1.4.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\pablo-pc\anaconda3\lib\site-packages (from pandas>=1.4.0->geopandas) (2023.3)
Requirement already satisfied: certifi in c:\users\pablo-pc\anaconda3\lib\site-packages (from pyogrio>=0.7.2->geopandas) (2024.2.2)
Requirement already satisfied: six>=1.5 in c:\users\pablo-pc\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.4.0->geopandas) (1.16.0)
In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
import requests
import datetime
import geopandas as gpd


import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

Lectura del dataset¶

In [3]:
path = 'Datasets/EU-SPI 2.0_2024_raw_data.xlsx'

df = pd.read_excel(path)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 48 columns):
 #   Column                                                              Non-Null Count  Dtype  
---  ------                                                              --------------  -----  
 0   Country                                                             242 non-null    object 
 1   NUTS code                                                           242 non-null    object 
 2   RegionName                                                          242 non-null    object 
 3   Infant mortality                                                    242 non-null    float64
 4   Satisfaction with water quality                                     236 non-null    float64
 5   Uncollected sewage                                                  242 non-null    float64
 6   Sewage treatment, additional                                        242 non-null    float64
 7   Safety at night                                                     236 non-null    float64
 8   Money Stolen                                                        236 non-null    float64
 9   Assaulted/Mugged                                                    236 non-null    float64
 10  Traffic deaths                                                      242 non-null    int64  
 11  Share of low-achieving 15 year olds in reading (level 1a or lower)  223 non-null    float64
 12  Share of low-achieving 15 year olds in maths and science            242 non-null    float64
 13  Lower-secondary completion only                                     241 non-null    float64
 14  Early school leavers                                                237 non-null    float64
 15  Broadband at home                                                   241 non-null    float64
 16  Digital skills above basic level                                    236 non-null    float64
 17  Online interaction with public authorities                          241 non-null    float64
 18  Internet access                                                     236 non-null    float64
 19  Freedom of media                                                    236 non-null    float64
 20  Subjective health status                                            236 non-null    float64
 21  Standardised cancer death rate                                      242 non-null    float64
 22  Standardised heart disease death rate                               242 non-null    float64
 23  Years of life lost due to air pollution                             234 non-null    float64
 24  Index of positive emotions                                          236 non-null    float64
 25  Air pollution NO2                                                   234 non-null    float64
 26  Air pollution Ozone (SOMO35)                                        234 non-null    float64
 27  Air pollution pm2.5                                                 234 non-null    float64
 28  Bathing water quality                                               216 non-null    float64
 29  Trust in the national government                                    236 non-null    float64
 30  Trust in the legal system                                           236 non-null    float64
 31  Trust in the police                                                 236 non-null    float64
 32  Voiced opinion to public official                                   236 non-null    float64
 33  Female participation in regional assemblies                         242 non-null    float64
 34  Institution quality index                                           240 non-null    float64
 35  Freedom over life choices                                           236 non-null    float64
 36  Job opportunities                                                   236 non-null    float64
 37  Teenage pregnancy                                                   242 non-null    float64
 38  Young people not in education, employment or training (NEET)        238 non-null    float64
 39  Institutions corruption index (EQI)                                 240 non-null    float64
 40  Institution impartiality index (EQI)                                240 non-null    float64
 41  Tolerance towards immigrants                                        236 non-null    float64
 42  Tolerance towards minorities                                        236 non-null    float64
 43  Tolerance towards  gay or lesbian people                            236 non-null    float64
 44  Women treated with respect                                          236 non-null    float64
 45  Tertiary education attainment                                       242 non-null    float64
 46  Lifelong learning                                                   241 non-null    float64
 47  Academic citations per 1000 persons                                 238 non-null    float64
dtypes: float64(44), int64(1), object(3)
memory usage: 90.9+ KB

Lectura de datos espaciales¶

Fuente: Eurostat

https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/territorial-units-statistics

In [6]:
# Cargar los shapefiles de las regiones NUTS
shapefile_path = 'NUTS_RG_20M_2021_3035.shp'
gdf = gpd.read_file(shapefile_path)
In [7]:
gdf.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 2010 entries, 0 to 2009
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   NUTS_ID     2010 non-null   object  
 1   LEVL_CODE   2010 non-null   int32   
 2   CNTR_CODE   2010 non-null   object  
 3   NAME_LATN   2010 non-null   object  
 4   NUTS_NAME   2010 non-null   object  
 5   MOUNT_TYPE  2009 non-null   float64 
 6   URBN_TYPE   2010 non-null   int32   
 7   COAST_TYPE  2010 non-null   int32   
 8   FID         2010 non-null   object  
 9   geometry    2010 non-null   geometry
dtypes: float64(1), geometry(1), int32(3), object(5)
memory usage: 133.6+ KB

GDP¶

  • Gross domestic product per capita at current market prices by NUTS 2 regions [nama_10r_2gdp]

https://ec.europa.eu/eurostat/databrowser/view/nama_10r_2gdp/default/table?lang=en

Unit of measure [UNIT]: Euro per inhabitant [EUR_HAB]

In [8]:
path = 'Datasets/gdp_pp.xlsx'
gdp_pp = pd.read_excel(path)

gdp_pp.head()
Out[8]:
GEO (Codes) GEO (Labels) 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
0 EU27_2020 European Union - 27 countries (from 2020) 18400 19200 19900 20300 21200 22000 23200 24600 25300 24100 24900 25700 25800 26000 26600 27500 28200 29300 30300 31300 30100 32700 35400
1 BE Belgium 25000 25700 26400 27100 28500 29600 30800 32300 32800 32100 33300 34100 34800 35200 36000 37000 38000 39100 40300 41700 39900 43800 47400
2 BE1 Région de Bruxelles-Capitale/Brussels Hoofdste... : : : 54900 57400 58700 60600 62200 61800 60900 62500 62200 63300 63400 64500 66100 66700 68400 69600 71700 68300 73400 77800
3 BE10 Région de Bruxelles-Capitale/Brussels Hoofdste... : : : 54900 57400 58700 60600 62200 61800 60900 62500 62200 63300 63400 64500 66100 66700 68400 69600 71700 68300 73400 77800
4 BE2 Vlaams Gewest : : : 26800 28100 29300 30700 32400 32900 32000 33200 34100 34900 35400 36200 37400 38600 39800 40900 42200 40600 45200 49000
In [9]:
gdp_pp.replace(':', np.nan, inplace=True)
In [10]:
gdp_pp_data = gdp_pp[['GEO (Codes)','2022']]

gdp_pp_data = gdp_pp_data.copy()
gdp_pp_data.rename(columns={'2022': 'gdp_per_capita_2022'}, inplace=True)

df = pd.merge(df, gdp_pp_data, how='left', left_on='NUTS code', right_on='GEO (Codes)')

df = df.drop(columns = 'GEO (Codes)')

Análisis Exploratorio¶

El dataset cuenta con:

  • 242 entradas: regiones de UE
  • 49 columnas, de las cuales 1 es el PIB per cápita y 3 corresponden a información geográfica: país, región y código NUTS (identificador estándar)
In [11]:
df.shape
Out[11]:
(242, 49)

Los tipos de datos parecen ser correctos

In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 49 columns):
 #   Column                                                              Non-Null Count  Dtype  
---  ------                                                              --------------  -----  
 0   Country                                                             242 non-null    object 
 1   NUTS code                                                           242 non-null    object 
 2   RegionName                                                          242 non-null    object 
 3   Infant mortality                                                    242 non-null    float64
 4   Satisfaction with water quality                                     236 non-null    float64
 5   Uncollected sewage                                                  242 non-null    float64
 6   Sewage treatment, additional                                        242 non-null    float64
 7   Safety at night                                                     236 non-null    float64
 8   Money Stolen                                                        236 non-null    float64
 9   Assaulted/Mugged                                                    236 non-null    float64
 10  Traffic deaths                                                      242 non-null    int64  
 11  Share of low-achieving 15 year olds in reading (level 1a or lower)  223 non-null    float64
 12  Share of low-achieving 15 year olds in maths and science            242 non-null    float64
 13  Lower-secondary completion only                                     241 non-null    float64
 14  Early school leavers                                                237 non-null    float64
 15  Broadband at home                                                   241 non-null    float64
 16  Digital skills above basic level                                    236 non-null    float64
 17  Online interaction with public authorities                          241 non-null    float64
 18  Internet access                                                     236 non-null    float64
 19  Freedom of media                                                    236 non-null    float64
 20  Subjective health status                                            236 non-null    float64
 21  Standardised cancer death rate                                      242 non-null    float64
 22  Standardised heart disease death rate                               242 non-null    float64
 23  Years of life lost due to air pollution                             234 non-null    float64
 24  Index of positive emotions                                          236 non-null    float64
 25  Air pollution NO2                                                   234 non-null    float64
 26  Air pollution Ozone (SOMO35)                                        234 non-null    float64
 27  Air pollution pm2.5                                                 234 non-null    float64
 28  Bathing water quality                                               216 non-null    float64
 29  Trust in the national government                                    236 non-null    float64
 30  Trust in the legal system                                           236 non-null    float64
 31  Trust in the police                                                 236 non-null    float64
 32  Voiced opinion to public official                                   236 non-null    float64
 33  Female participation in regional assemblies                         242 non-null    float64
 34  Institution quality index                                           240 non-null    float64
 35  Freedom over life choices                                           236 non-null    float64
 36  Job opportunities                                                   236 non-null    float64
 37  Teenage pregnancy                                                   242 non-null    float64
 38  Young people not in education, employment or training (NEET)        238 non-null    float64
 39  Institutions corruption index (EQI)                                 240 non-null    float64
 40  Institution impartiality index (EQI)                                240 non-null    float64
 41  Tolerance towards immigrants                                        236 non-null    float64
 42  Tolerance towards minorities                                        236 non-null    float64
 43  Tolerance towards  gay or lesbian people                            236 non-null    float64
 44  Women treated with respect                                          236 non-null    float64
 45  Tertiary education attainment                                       242 non-null    float64
 46  Lifelong learning                                                   241 non-null    float64
 47  Academic citations per 1000 persons                                 238 non-null    float64
 48  gdp_per_capita_2022                                                 242 non-null    float64
dtypes: float64(45), int64(1), object(3)
memory usage: 92.8+ KB

Datos perdidos¶

No hay datos perdidos en las variables categoricas.

Alguos datos perdidos en los indicadores

En general pocos datos perdidos, no mas de un 3.5% por cada variable, excepto:

  • Bathing water quality: 10%
  • Share of low-achieving 15 year olds in reading (level 1a or lower): 7%

[Figura 00.1]

In [13]:
nan_count = df.isnull().sum()

nan_percentage = round((nan_count / len(df)) * 100, 2)

nan_summary = pd.DataFrame({
    'indicador': nan_count.index,
    'missing': nan_count.values,
    'proporcion': nan_percentage.values
})

nan_summary = nan_summary[nan_summary['missing'] > 0]
print("Figura 00.1")
print(nan_summary)
Figura 00.1
                                            indicador  missing  proporcion
4                     Satisfaction with water quality        6        2.48
7                                     Safety at night        6        2.48
8                                        Money Stolen        6        2.48
9                                    Assaulted/Mugged        6        2.48
11  Share of low-achieving 15 year olds in reading...       19        7.85
13                    Lower-secondary completion only        1        0.41
14                               Early school leavers        5        2.07
15                                  Broadband at home        1        0.41
16                   Digital skills above basic level        6        2.48
17         Online interaction with public authorities        1        0.41
18                                    Internet access        6        2.48
19                                   Freedom of media        6        2.48
20                           Subjective health status        6        2.48
23            Years of life lost due to air pollution        8        3.31
24                         Index of positive emotions        6        2.48
25                                  Air pollution NO2        8        3.31
26                       Air pollution Ozone (SOMO35)        8        3.31
27                                Air pollution pm2.5        8        3.31
28                              Bathing water quality       26       10.74
29                   Trust in the national government        6        2.48
30                          Trust in the legal system        6        2.48
31                                Trust in the police        6        2.48
32                  Voiced opinion to public official        6        2.48
34                          Institution quality index        2        0.83
35                          Freedom over life choices        6        2.48
36                                  Job opportunities        6        2.48
38  Young people not in education, employment or t...        4        1.65
39                Institutions corruption index (EQI)        2        0.83
40               Institution impartiality index (EQI)        2        0.83
41                       Tolerance towards immigrants        6        2.48
42                       Tolerance towards minorities        6        2.48
43           Tolerance towards  gay or lesbian people        6        2.48
44                         Women treated with respect        6        2.48
46                                  Lifelong learning        1        0.41
47               Academic citations per 1000 persons         4        1.65
In [14]:
filtered_df = df[['Country', 'Bathing water quality', 'Share of low-achieving 15 year olds in reading (level 1a or lower)']]

grouped_df = filtered_df.groupby('Country')

nan_summary_list = []

for name, group in grouped_df:
    nan_count = group.isnull().sum()
    nan_percentage = round((nan_count / len(group)) * 100, 2)

    nan_summary = pd.DataFrame({
        'Country': name,
        'indicador': nan_count.index,
        'missing': nan_count.values,
        'proporcion': nan_percentage.values
    })

    nan_summary_list.append(nan_summary)

nan_summary_final = pd.concat(nan_summary_list)
nan_summary_final = nan_summary_final[nan_summary_final['missing'] > 0]

nan_summary_final
Out[14]:
Country indicador missing proporcion
1 BE Bathing water quality 4 36.36
1 BG Bathing water quality 4 66.67
1 CZ Bathing water quality 1 12.50
1 DE Bathing water quality 2 5.26
1 EL Bathing water quality 1 7.69
1 ES Bathing water quality 1 5.26
2 ES Share of low-achieving 15 year olds in reading... 19 100.00
1 HR Bathing water quality 1 25.00
1 HU Bathing water quality 1 12.50
1 IT Bathing water quality 1 4.76
1 RO Bathing water quality 7 87.50
1 SE Bathing water quality 1 12.50
1 SK Bathing water quality 2 50.00

Los datos perdidos de "Bathing water quality" están repartidos entre paises, sin embargo el 100% de perdidos de "Share of low-achieving 15 year olds in reading..." están en España.

Comprobamos que el indicador "Share of low-achieving 15 year olds in reading (level 1a or lower)" no tiene datos para España

In [15]:
df[df['Country']=="ES"]['Share of low-achieving 15 year olds in reading (level 1a or lower)']
Out[15]:
92    NaN
93    NaN
94    NaN
95    NaN
96    NaN
97    NaN
98    NaN
99    NaN
100   NaN
101   NaN
102   NaN
103   NaN
104   NaN
105   NaN
106   NaN
107   NaN
108   NaN
109   NaN
110   NaN
Name: Share of low-achieving 15 year olds in reading (level 1a or lower), dtype: float64

Comprobamos si existe algún indicador más con todos los datos perdidos en algún pais:

[Figura 00.2]

In [16]:
missing_all_by_country = {}

# Obtener la lista de países únicos
paises = df['Country'].unique()

# Iterar sobre cada país
for pais in paises:
    # Filtrar el DataFrame por el país actual
    df_pais = df[df['Country'] == pais]

    # Encontrar las variables con todos los datos perdidos para este país
    variables_con_todos_perdidos = df_pais.columns[df_pais.isnull().all()].tolist()

    # Guardar el resultado en el diccionario
    missing_all_by_country[pais] = variables_con_todos_perdidos

print("Figura 00.2")
print()
# Mostrar el resultado
for pais, variables in missing_all_by_country.items():
    print(f"País: {pais}, Variables con todos los datos perdidos: {variables}")
Figura 00.2

País: AT, Variables con todos los datos perdidos: []
País: BE, Variables con todos los datos perdidos: []
País: BG, Variables con todos los datos perdidos: []
País: CY, Variables con todos los datos perdidos: ['Digital skills above basic level']
País: CZ, Variables con todos los datos perdidos: []
País: DE, Variables con todos los datos perdidos: []
País: DK, Variables con todos los datos perdidos: []
País: EE, Variables con todos los datos perdidos: ['Digital skills above basic level']
País: EL, Variables con todos los datos perdidos: []
País: ES, Variables con todos los datos perdidos: ['Share of low-achieving 15 year olds in reading (level 1a or lower)']
País: FI, Variables con todos los datos perdidos: []
País: FR, Variables con todos los datos perdidos: []
País: HR, Variables con todos los datos perdidos: []
País: HU, Variables con todos los datos perdidos: []
País: IE, Variables con todos los datos perdidos: []
País: IT, Variables con todos los datos perdidos: []
País: LT, Variables con todos los datos perdidos: []
País: LU, Variables con todos los datos perdidos: ['Digital skills above basic level']
País: LV, Variables con todos los datos perdidos: ['Digital skills above basic level']
País: MT, Variables con todos los datos perdidos: ['Digital skills above basic level']
País: NL, Variables con todos los datos perdidos: []
País: PL, Variables con todos los datos perdidos: []
País: PT, Variables con todos los datos perdidos: []
País: RO, Variables con todos los datos perdidos: []
País: SE, Variables con todos los datos perdidos: []
País: SI, Variables con todos los datos perdidos: []
País: SK, Variables con todos los datos perdidos: []

Estadísticas descriptivas¶

[Figura 00.3]

In [17]:
round(df.describe(),2)
Out[17]:
Infant mortality Satisfaction with water quality Uncollected sewage Sewage treatment, additional Safety at night Money Stolen Assaulted/Mugged Traffic deaths Share of low-achieving 15 year olds in reading (level 1a or lower) Share of low-achieving 15 year olds in maths and science Lower-secondary completion only Early school leavers Broadband at home Digital skills above basic level Online interaction with public authorities Internet access Freedom of media Subjective health status Standardised cancer death rate Standardised heart disease death rate Years of life lost due to air pollution Index of positive emotions Air pollution NO2 Air pollution Ozone (SOMO35) Air pollution pm2.5 Bathing water quality Trust in the national government Trust in the legal system Trust in the police Voiced opinion to public official Female participation in regional assemblies Institution quality index Freedom over life choices Job opportunities Teenage pregnancy Young people not in education, employment or training (NEET) Institutions corruption index (EQI) Institution impartiality index (EQI) Tolerance towards immigrants Tolerance towards minorities Tolerance towards gay or lesbian people Women treated with respect Tertiary education attainment Lifelong learning Academic citations per 1000 persons gdp_per_capita_2022
count 242.00 236.00 242.00 242.00 236.00 236.00 236.00 242.00 223.00 242.00 241.00 237.00 241.00 236.00 241.00 236.00 236.00 236.00 242.00 242.00 234.00 236.00 234.00 234.00 234.00 216.00 236.00 236.00 236.00 236.00 242.00 240.00 236.00 236.00 242.00 238.00 240.00 240.00 236.00 236.00 236.00 236.00 242.00 241.00 238.00 242.00
mean 3.21 82.34 2.40 84.30 75.53 7.98 3.12 48.38 22.91 45.24 19.70 9.56 89.49 26.74 60.05 89.46 72.18 68.13 78.71 47.12 552.96 71.68 12.92 4182.70 10.90 0.78 44.27 56.53 79.14 23.58 33.09 0.10 80.04 48.59 8.19 10.33 0.13 0.11 71.46 75.72 64.96 71.18 33.27 12.09 3.76 33784.71
std 1.45 9.38 7.91 23.34 7.63 2.75 1.82 24.06 6.98 9.01 11.43 4.45 5.01 10.00 20.13 7.21 16.62 7.44 18.71 32.13 404.43 4.67 4.39 1418.50 3.58 0.22 14.61 16.04 8.51 7.36 12.07 0.98 10.01 17.32 10.73 4.94 1.00 0.98 14.00 10.89 20.63 11.39 10.32 7.60 3.48 17842.51
min 0.00 43.41 0.00 0.00 55.53 1.84 0.57 0.00 11.05 31.01 1.50 1.43 73.50 6.88 10.46 62.33 28.85 25.22 46.32 18.02 0.00 56.10 0.30 1524.81 3.64 0.05 6.47 9.93 46.14 9.29 4.00 -2.70 48.08 11.30 1.04 2.73 -2.56 -2.45 24.85 44.58 9.17 34.53 13.70 0.90 0.00 8500.00
25% 2.30 76.93 0.00 79.33 71.18 6.12 1.55 32.00 20.69 40.67 12.30 6.33 86.72 20.33 45.86 87.34 64.42 64.31 66.58 26.10 267.95 68.01 10.51 3129.04 8.59 0.70 35.07 44.73 76.81 17.71 23.85 -0.70 75.26 35.17 3.22 6.99 -0.67 -0.72 64.78 69.47 52.16 62.81 25.88 7.20 1.25 20375.00
50% 3.00 83.31 0.00 95.71 75.89 7.83 2.82 43.00 20.94 42.36 17.50 8.60 89.69 23.37 61.84 90.74 75.18 68.55 74.85 34.30 405.40 72.81 12.59 3990.71 10.00 0.85 43.12 55.56 81.41 23.21 31.93 0.26 82.85 48.24 5.71 9.23 -0.03 0.22 74.07 78.94 70.35 73.08 32.30 10.60 2.80 32700.00
75% 3.70 88.85 0.00 99.46 80.92 9.86 4.22 60.00 23.85 49.14 25.70 12.10 92.49 32.76 76.56 94.39 84.57 73.18 86.49 54.03 781.77 74.66 15.50 4922.12 12.98 0.93 55.98 70.66 84.14 29.25 45.02 0.88 87.19 61.59 7.34 12.79 1.03 0.93 80.92 83.22 79.48 79.80 40.65 14.50 5.37 44125.00
max 9.70 99.91 55.56 100.00 92.74 15.13 8.14 159.00 47.10 71.05 58.70 26.03 100.00 52.53 94.35 99.73 98.09 81.20 147.60 179.74 1871.59 81.61 31.18 8620.22 23.24 1.00 84.88 91.16 92.30 46.92 53.66 2.49 96.67 86.56 71.78 28.63 2.14 2.81 96.28 90.67 94.99 93.38 62.10 38.10 21.27 120300.00

Outliers¶

Outliers con Z-Score

El Z-score mide cuántas desviaciones estándar un dato se encuentra por encima o por debajo de la media de la distribución

Z = (X - μ)\σ

Un Z-score superior a 3 o inferior a -3 generalmente se considera un outlier

[Figura 00.4]

In [18]:
from scipy import stats

num_cols = df.select_dtypes(include=['float64', 'int64']).columns

z_scores = np.abs(stats.zscore(df[num_cols]))
outliers = (z_scores > 3).any(axis=1)
outliers_df = df[outliers]

print("Figura 00.4")
print()
print(outliers_df)
Figura 00.4

    Country NUTS code                  RegionName  Infant mortality  \
20       BG      BG31               Severozapaden               7.1   
21       BG      BG32          Severen tsentralen               6.4   
22       BG      BG33              Severoiztochen               5.8   
23       BG      BG34                Yugoiztochen               9.6   
24       BG      BG41                 Yugozapaden               3.1   
25       BG      BG42           Yuzhen tsentralen               5.1   
73       DK      DK01                 Hovedstaden               2.9   
138      FR      FRY1                  Guadeloupe               9.7   
139      FR      FRY2                  Martinique               9.1   
140      FR      FRY3                      Guyane               8.0   
142      FR      FRY5                     Mayotte               8.8   
144      HR      HR03          Jadranska Hrvatska               4.2   
145      HR      HR05                 Grad Zagreb               2.2   
149      HU      HU21              Közép-Dunántúl               4.2   
151      HU      HU23                Dél-Dunántúl               2.9   
152      HU      HU31          Észak-Magyarország               4.7   
153      HU      HU32                Észak-Alföld               4.5   
154      HU      HU33                  Dél-Alföld               3.2   
156      IE      IE05                    Southern               3.7   
157      IE      IE06         Eastern and Midland               2.6   
160      IT      ITC3                     Liguria               2.8   
168      IT      ITG1                     Sicilia               3.4   
181      LU      LU00                  Luxembourg               3.1   
219      PT      PT30  Região Autónoma da Madeira               3.4   
220      RO      RO11                   Nord-Vest               5.8   
221      RO      RO12                      Centru               6.2   
222      RO      RO21                    Nord-Est               5.7   
223      RO      RO22                     Sud-Est               6.2   
224      RO      RO31                Sud-Muntenia               4.9   
226      RO      RO41            Sud-Vest Oltenia               6.1   
241      SK      SK04          Východné Slovensko               8.1   

     Satisfaction with water quality  Uncollected sewage  \
20                         66.030690           13.112778   
21                         65.559160            7.750597   
22                         53.167801            2.871719   
23                         56.738884            4.071538   
24                         72.361886            3.876427   
25                         62.183077           10.397535   
73                         86.230570            0.000000   
138                              NaN            0.000000   
139                              NaN            0.000000   
140                              NaN            0.000000   
142                              NaN           38.083379   
144                        83.390794           11.764233   
145                        69.245381            0.000000   
149                        83.906151            0.000000   
151                        70.721171            0.000000   
152                        75.840807            0.000000   
153                        80.710030            0.000000   
154                        75.607066            0.000000   
156                        86.026341            0.000000   
157                        83.535296            0.000000   
160                        77.921639            0.000000   
168                        80.354638            2.906338   
181                        84.259418            0.000000   
219                        93.903471            3.223928   
220                        66.868325           31.978751   
221                        71.426739           35.952902   
222                        78.529510           38.579372   
223                        77.661334           30.262179   
224                        71.285612           55.555370   
226                        73.554553           55.233976   
241                        83.981749            1.208562   

     Sewage treatment, additional  Safety at night  Money Stolen  \
20                      56.147543        64.362238      4.394227   
21                      71.623448        56.013020      4.154489   
22                      89.369157        60.331069      6.455514   
23                      83.293837        67.406096      9.026132   
24                      74.180576        61.447177      7.172962   
25                      40.105975        75.855519      5.189070   
73                     100.000000        81.072249      7.802786   
138                     78.519761              NaN           NaN   
139                     68.353602              NaN           NaN   
140                      0.000000              NaN           NaN   
142                      0.603897              NaN           NaN   
144                      0.200850        85.271090      4.635883   
145                      0.000000        75.959677     10.980450   
149                     82.287945        77.602417      6.654045   
151                     63.942447        80.948219      5.441167   
152                     68.019629        68.852846      4.486461   
153                     82.317694        66.717524     10.920720   
154                     42.585622        72.990451     10.481930   
156                     50.097924        74.600090      6.100404   
157                     90.430568        77.087155      6.932770   
160                     10.338217        64.716392      9.629818   
168                     10.513516        71.112526      9.334710   
181                     93.047115        76.832755      7.638356   
219                     10.111170        80.023656      4.292296   
220                     60.475349        71.532229      8.228294   
221                     52.602769        60.536476      2.512792   
222                     50.390541        61.596737      8.532092   
223                     60.029871        70.520433      4.682273   
224                     27.383876        67.770895      7.784635   
226                     29.270395        66.802958      5.503103   
241                     86.116320        70.760150      9.250217   

     Assaulted/Mugged  Traffic deaths  \
20           2.793928             133   
21           1.018728             103   
22           1.767896              66   
23           2.371754              80   
24           3.370132              75   
25           1.719668              61   
73           2.560730              12   
138               NaN             159   
139               NaN              81   
140               NaN             120   
142               NaN              42   
144          1.496991              92   
145          7.228171              31   
149          1.540083              75   
151          2.656142              73   
152          3.388809              56   
153          2.439287              64   
154          3.122959              69   
156          5.415271              26   
157          4.241856              23   
160          5.164736              39   
168          3.925364              33   
181          0.678235              38   
219          1.827482              47   
220          2.492410              83   
221          1.477218              76   
222          2.838239             102   
223          2.550227             116   
224          4.933595             109   
226          2.359531             112   
241          1.212579              36   

     Share of low-achieving 15 year olds in reading (level 1a or lower)  \
20                                           47.102802                    
21                                           47.102802                    
22                                           47.102802                    
23                                           47.102802                    
24                                           47.102802                    
25                                           47.102802                    
73                                           15.999534                    
138                                          20.937250                    
139                                          20.937250                    
140                                          20.937250                    
142                                          20.937250                    
144                                          21.578840                    
145                                          21.578840                    
149                                          25.273944                    
151                                          25.273944                    
152                                          25.273944                    
153                                          25.273944                    
154                                          25.273944                    
156                                          11.799349                    
157                                          11.799349                    
160                                          23.267173                    
168                                          23.267173                    
181                                          29.291415                    
219                                          20.221871                    
220                                          40.838355                    
221                                          40.838355                    
222                                          40.838355                    
223                                          40.838355                    
224                                          40.838355                    
226                                          40.838355                    
241                                          31.410025                    

     Share of low-achieving 15 year olds in maths and science  \
20                                           68.137605          
21                                           68.137605          
22                                           68.137605          
23                                           68.137605          
24                                           68.137605          
25                                           68.137605          
73                                           36.541697          
138                                          42.357320          
139                                          42.357320          
140                                          42.357320          
142                                          42.357320          
144                                          58.539807          
145                                          58.539807          
149                                          49.274133          
151                                          49.274133          
152                                          49.274133          
153                                          49.274133          
154                                          49.274133          
156                                          40.407878          
157                                          40.407878          
160                                          46.736349          
168                                          46.736349          
181                                          48.909676          
219                                          44.178392          
220                                          71.046676          
221                                          71.046676          
222                                          71.046676          
223                                          71.046676          
224                                          71.046676          
226                                          71.046676          
241                                          46.440243          

     Lower-secondary completion only  Early school leavers  Broadband at home  \
20                              18.0             17.633333              73.50   
21                              15.4              9.933333              82.67   
22                              21.7             13.266667              85.23   
23                              22.1             21.633333              81.73   
24                               7.8              6.333333              86.23   
25                              19.6             12.300000              85.53   
73                              13.8              8.300000              94.48   
138                             34.4             13.250000              75.37   
139                             29.1             13.350000              85.81   
140                             48.5             26.033333              79.16   
142                              NaN             14.300000              83.55   
144                              9.2              1.633333              87.38   
145                              5.3              2.700000              86.06   
149                             13.2             11.333333              92.10   
151                             17.5             15.266667              89.03   
152                             20.6             22.233333              86.26   
153                             19.2             16.333333              88.04   
154                             14.0             10.800000              86.72   
156                             12.8              4.700000              90.78   
157                             11.2              3.933333              95.22   
160                             30.4             11.300000              88.85   
168                             47.6             19.800000              83.43   
181                             18.4              8.566667              97.35   
219                             51.7                   NaN              87.14   
220                             17.2             15.700000              89.60   
221                             18.9             21.766667              89.58   
222                             23.9             16.100000              86.89   
223                             25.7             22.200000              84.29   
224                             20.0             15.266667              85.76   
226                             16.7             12.700000              85.90   
241                              9.6             12.433333              87.04   

     Digital skills above basic level  \
20                           6.881001   
21                           7.739488   
22                           7.979152   
23                           7.651486   
24                           8.072771   
25                           8.007238   
73                          38.182303   
138                         29.632703   
139                         29.632703   
140                         29.632703   
142                         29.632703   
144                         31.658243   
145                         30.201776   
149                         21.853206   
151                         21.124765   
152                         20.467508   
153                         20.889861   
154                         20.576656   
156                         38.597303   
157                         40.485076   
160                         22.611617   
168                         21.232270   
181                               NaN   
219                         29.571648   
220                          8.868571   
221                          8.866592   
222                          8.600337   
223                          8.342990   
224                          8.488490   
226                          8.502347   
241                         20.160605   

     Online interaction with public authorities  Internet access  \
20                                        17.68        71.324158   
21                                        22.13        74.007414   
22                                        25.93        75.417230   
23                                        18.54        71.497558   
24                                        36.30        84.331716   
25                                        25.16        70.040531   
73                                        94.35        99.265970   
138                                       74.04              NaN   
139                                       74.87              NaN   
140                                       78.02              NaN   
142                                       74.17              NaN   
144                                       47.40        91.214831   
145                                       45.05        97.123682   
149                                       73.02        87.876316   
151                                       66.93        90.464120   
152                                       64.66        86.238250   
153                                       63.69        91.535815   
154                                       69.45        93.522051   
156                                       91.00        88.381400   
157                                       90.89        94.219075   
160                                       35.53        95.279201   
168                                       27.10        94.900901   
181                                       78.20        98.421072   
219                                       43.79        90.920181   
220                                       12.80        87.431754   
221                                       16.88        82.718279   
222                                       12.05        63.628775   
223                                       12.94        67.054536   
224                                       10.46        62.334884   
226                                       13.64        74.298647   
241                                       54.14        82.243411   

     Freedom of media  Subjective health status  \
20          44.017563                 64.173979   
21          31.580732                 64.173979   
22          36.038126                 64.173979   
23          31.750105                 64.173979   
24          28.854629                 70.663741   
25          34.495191                 70.663741   
73          89.739823                 66.979339   
138               NaN                       NaN   
139               NaN                       NaN   
140               NaN                       NaN   
142               NaN                       NaN   
144         48.349965                 62.767847   
145         72.226331                 62.767847   
149         46.029355                 65.594238   
151         43.696804                 65.594238   
152         53.355036                 61.617039   
153         39.128852                 61.617039   
154         47.950009                 61.617039   
156         82.383342                 81.202145   
157         82.135800                 81.202145   
160         70.219241                 76.117435   
168         68.108595                 71.736228   
181         59.657466                 76.452805   
219         83.878809                 46.624557   
220         63.211857                 74.332222   
221         60.370669                 74.332222   
222         68.098965                 70.020375   
223         63.355079                 70.020375   
224         61.167339                 72.555964   
226         70.207752                 74.884357   
241         80.322566                 65.252924   

     Standardised cancer death rate  Standardised heart disease death rate  \
20                           111.71                                 177.01   
21                           107.45                                 161.38   
22                           103.87                                 146.30   
23                            93.60                                 179.74   
24                            86.49                                 158.98   
25                           100.09                                 157.09   
73                            70.27                                  25.54   
138                           56.47                                  25.77   
139                           66.75                                  30.84   
140                           56.24                                  46.99   
142                           67.70                                  68.42   
144                           99.29                                  54.76   
145                          105.62                                  71.97   
149                          140.73                                 106.28   
151                          143.70                                  98.91   
152                          147.60                                 129.75   
153                          142.37                                 118.43   
154                          137.49                                 115.55   
156                           66.51                                  33.29   
157                           65.78                                  29.42   
160                           64.99                                  22.22   
168                           65.88                                  32.01   
181                           59.83                                  25.77   
219                           88.70                                  46.17   
220                          116.66                                 123.41   
221                          110.47                                 107.58   
222                          118.95                                 111.04   
223                          123.76                                 113.24   
224                          118.27                                 118.95   
226                          106.41                                 116.69   
241                           95.31                                  85.59   

     Years of life lost due to air pollution  Index of positive emotions  \
20                               1595.331861                   69.662316   
21                               1403.075472                   60.985101   
22                               1400.292431                   59.174885   
23                               1258.053269                   56.982241   
24                               1435.164438                   69.834075   
25                               1550.893974                   62.956455   
73                                231.271641                   79.335212   
138                                      NaN                         NaN   
139                                      NaN                         NaN   
140                                      NaN                         NaN   
142                                      NaN                         NaN   
144                               625.269271                   69.483735   
145                              1215.616194                   69.834275   
149                               982.421809                   63.924973   
151                              1018.261308                   68.514176   
152                              1450.165143                   62.924879   
153                              1199.113915                   64.627739   
154                              1032.157366                   59.852810   
156                               132.990230                   77.708654   
157                               108.084469                   75.688084   
160                               387.721102                   67.905975   
168                               531.279547                   67.172942   
181                               140.060215                   70.450629   
219                                      NaN                   73.135644   
220                              1022.404058                   58.620701   
221                               934.829901                   65.335070   
222                              1153.484740                   68.277262   
223                               788.980873                   69.396496   
224                              1142.595567                   63.121105   
226                              1430.731088                   63.540169   
241                              1248.576541                   72.737662   

     Air pollution NO2  Air pollution Ozone (SOMO35)  Air pollution pm2.5  \
20           14.358467                   3034.710975            16.199397   
21           16.074244                   3101.807327            14.803620   
22           15.515333                   3242.947617            14.763625   
23           15.779245                   3112.482382            13.760083   
24           21.384778                   2691.976191            15.048761   
25           17.253786                   3371.289643            15.903341   
73            9.225796                   2542.431365             8.136573   
138                NaN                           NaN                  NaN   
139                NaN                           NaN                  NaN   
140                NaN                           NaN                  NaN   
142                NaN                           NaN                  NaN   
144          10.537707                   6110.991568            11.423375   
145          18.700000                   5134.266667            17.700000   
149          12.349501                   4566.467173            12.895554   
151          11.035812                   4623.226746            13.225339   
152          12.703328                   4200.720981            16.813405   
153          14.119673                   3973.778091            14.694741   
154          13.970837                   4321.338619            13.346143   
156           6.476355                   2401.300376             7.364581   
157          10.894495                   1576.778311             6.955132   
160          15.270986                   6607.424456             9.824047   
168          11.568476                   6339.201472            11.679124   
181          14.000000                   3486.166667             7.400000   
219                NaN                           NaN                  NaN   
220          17.183943                   3119.923300            13.547579   
221          17.491909                   2736.542365            12.781361   
222          16.701928                   2686.237008            14.648731   
223          17.708084                   3049.873969            11.505344   
224          17.101476                   3206.913901            14.568413   
226          15.526070                   3521.443159            17.119240   
241          11.888746                   3743.048134            17.288746   

     Bathing water quality  Trust in the national government  \
20                     NaN                         26.260028   
21                     NaN                         15.006224   
22                0.767442                          6.466323   
23                1.000000                         20.480822   
24                     NaN                         25.659458   
25                     NaN                         18.695278   
73                0.891156                         64.768757   
138               0.703125                               NaN   
139               0.645161                               NaN   
140               0.090909                               NaN   
142               0.285714                               NaN   
144               0.984496                         28.180843   
145               0.052632                         40.392784   
149               0.739130                         46.168330   
151               0.720000                         41.732612   
152               0.357143                         44.408648   
153               0.384615                         33.667501   
154               0.280000                         42.039402   
156               0.896552                         67.208976   
157               0.677419                         62.531003   
160               0.856448                         38.906744   
168               0.781609                         38.345288   
181               0.823529                         35.446173   
219               0.827586                         53.788383   
220                    NaN                         15.084238   
221                    NaN                         17.013996   
222                    NaN                         16.551924   
223               0.840000                         25.826363   
224                    NaN                         19.690609   
226                    NaN                         21.236413   
241               0.600000                         25.835636   

     Trust in the legal system  Trust in the police  \
20                   24.777231            63.754444   
21                   26.045265            62.817532   
22                   17.230189            50.691063   
23                   28.126740            72.332635   
24                    9.929840            46.140749   
25                   25.092784            69.510244   
73                   87.825540            84.103289   
138                        NaN                  NaN   
139                        NaN                  NaN   
140                        NaN                  NaN   
142                        NaN                  NaN   
144                  24.530729            80.213798   
145                  35.542155            74.470215   
149                  43.320446            77.247959   
151                  52.780813            78.328057   
152                  48.043374            64.555192   
153                  48.620709            68.640397   
154                  41.307942            73.266985   
156                  69.660096            77.392460   
157                  66.247598            83.366891   
160                  48.991338            77.473463   
168                  46.867920            77.486812   
181                  48.133744            70.158365   
219                  49.710758            78.591264   
220                  37.831775            60.391196   
221                  38.308302            64.139650   
222                  42.362445            59.209724   
223                  44.351449            71.066672   
224                  38.591906            62.392230   
226                  40.821090            71.391120   
241                  57.312993            81.962150   

     Voiced opinion to public official  \
20                           17.232252   
21                           15.617317   
22                           12.249868   
23                           23.930437   
24                           20.637907   
25                           12.206946   
73                           41.638605   
138                                NaN   
139                                NaN   
140                                NaN   
142                                NaN   
144                          16.658870   
145                          18.352706   
149                          20.782973   
151                          15.948091   
152                          33.549440   
153                          18.170044   
154                          30.945189   
156                          26.684969   
157                          25.694373   
160                          20.726316   
168                          16.155737   
181                          12.229954   
219                          36.653251   
220                          20.346496   
221                          17.183250   
222                          20.643253   
223                          28.289891   
224                          24.953060   
226                          23.679496   
241                          22.007959   

     Female participation in regional assemblies  Institution quality index  \
20                                     23.750000                     -2.703   
21                                     23.750000                     -1.392   
22                                     23.750000                     -1.406   
23                                     23.750000                     -1.712   
24                                     23.750000                     -2.160   
25                                     23.750000                     -2.694   
73                                     47.619048                      1.781   
138                                    48.780488                     -1.204   
139                                    45.098039                     -0.839   
140                                    40.000000                     -1.508   
142                                    50.000000                     -1.968   
144                                    29.927007                     -0.789   
145                                    39.583333                     -1.240   
149                                    16.981132                     -0.874   
151                                    10.416667                     -1.096   
152                                    11.864407                     -1.147   
153                                    14.705882                     -1.095   
154                                    14.754098                     -1.018   
156                                    27.802691                      1.208   
157                                    27.802691                      1.032   
160                                    19.354839                      0.041   
168                                    24.285714                     -2.116   
181                                    35.000000                      1.235   
219                                    29.787234                      0.429   
220                                    21.319797                     -1.041   
221                                    18.000000                     -1.238   
222                                    19.718310                     -1.620   
223                                    20.975610                     -1.729   
224                                    21.459227                     -1.755   
226                                    16.265060                     -1.302   
241                                    10.743802                     -0.811   

     Freedom over life choices  Job opportunities  Teenage pregnancy  \
20                   79.932867          16.523476          47.283763   
21                   64.996804          14.124578          28.625954   
22                   80.641157          19.366971          25.999151   
23                   73.586025          22.753249          63.776133   
24                   70.934462          30.302663          21.334576   
25                   61.620352          35.250503          42.887199   
73                   92.412159          75.818968           1.086778   
138                        NaN                NaN          11.798980   
139                        NaN                NaN          12.450852   
140                        NaN                NaN          69.213612   
142                        NaN                NaN          71.783842   
144                  62.843158          36.848069           3.040202   
145                  62.089196          58.329276           3.131277   
149                  79.881765          53.005955          13.340134   
151                  84.585888          37.816191          19.900971   
152                  71.567585          37.858879          43.128906   
153                  72.606966          37.829059          31.915989   
154                  71.640371          36.610524          13.689940   
156                  89.788114          61.826880           4.283756   
157                  87.391734          56.608397           5.052588   
160                  68.964543          32.359360           2.258866   
168                  75.993345          18.331588           7.728705   
181                  82.573643          44.551035           2.831613   
219                  85.051266          57.727387           4.492007   
220                  86.472665          48.446216          33.829745   
221                  81.479900          45.584064          46.593688   
222                  91.418181          21.737960          29.886090   
223                  79.014832          34.930716          37.524013   
224                  80.123854          31.940370          40.317092   
226                  82.534399          30.514953          34.315044   
241                  69.956341          11.860318          49.449936   

     Young people not in education, employment or training (NEET)  \
20                                           25.466667              
21                                           13.733333              
22                                           13.733333              
23                                           19.400000              
24                                            7.166667              
25                                           14.666667              
73                                            6.300000              
138                                          18.400000              
139                                          18.033333              
140                                          28.333333              
142                                          20.800000              
144                                          12.833333              
145                                           6.833333              
149                                           8.466667              
151                                          12.666667              
152                                          17.300000              
153                                          16.700000              
154                                           9.533333              
156                                           8.533333              
157                                           8.766667              
160                                          15.366667              
168                                          28.633333              
181                                           7.433333              
219                                          12.500000              
220                                          12.200000              
221                                          24.100000              
222                                          13.466667              
223                                          21.500000              
224                                          18.333333              
226                                          20.433333              
241                                          16.100000              

     Institutions corruption index (EQI)  \
20                                -2.563   
21                                -0.110   
22                                -1.049   
23                                -1.153   
24                                -1.806   
25                                -1.481   
73                                 1.513   
138                               -1.174   
139                               -0.565   
140                               -0.894   
142                               -1.511   
144                               -0.798   
145                               -1.189   
149                               -1.055   
151                               -1.209   
152                               -1.492   
153                               -1.587   
154                               -1.368   
156                                1.258   
157                                0.641   
160                                0.178   
168                               -1.320   
181                                1.199   
219                               -0.267   
220                               -1.033   
221                               -0.552   
222                               -1.390   
223                               -1.343   
224                               -1.600   
226                               -1.091   
241                               -1.114   

     Institution impartiality index (EQI)  Tolerance towards immigrants  \
20                                 -1.648                     26.633780   
21                                 -0.100                     28.055398   
22                                 -0.325                     24.849501   
23                                 -0.948                     34.070780   
24                                 -1.541                     34.270100   
25                                 -2.293                     49.345868   
73                                  1.157                     88.113235   
138                                -0.381                           NaN   
139                                -0.043                           NaN   
140                                -0.347                           NaN   
142                                 0.130                           NaN   
144                                -0.855                     39.516334   
145                                -1.424                     56.059397   
149                                -0.624                     40.264304   
151                                -1.071                     43.036625   
152                                -1.009                     36.170510   
153                                -1.497                     33.861097   
154                                -1.019                     36.533368   
156                                 0.893                     85.557875   
157                                 0.948                     85.420738   
160                                 0.014                     74.070436   
168                                -2.452                     75.650138   
181                                 1.296                     63.277318   
219                                 0.523                     91.311816   
220                                -0.841                     60.722968   
221                                -0.537                     58.441035   
222                                -1.085                     45.662320   
223                                -1.218                     48.234844   
224                                -0.743                     47.866458   
226                                -0.989                     50.827081   
241                                -1.146                     49.119725   

     Tolerance towards minorities  Tolerance towards  gay or lesbian people  \
20                      60.040456                                 21.227692   
21                      59.526223                                 13.980755   
22                      57.038981                                 15.248122   
23                      50.858231                                 11.725824   
24                      53.966053                                 23.253465   
25                      74.447695                                 39.809715   
73                      85.831922                                 90.188596   
138                           NaN                                       NaN   
139                           NaN                                       NaN   
140                           NaN                                       NaN   
142                           NaN                                       NaN   
144                     52.916913                                 36.052889   
145                     68.049388                                 55.800389   
149                     63.171081                                 47.773431   
151                     74.218147                                 47.364547   
152                     71.549399                                 30.769921   
153                     62.175433                                 33.227596   
154                     66.929844                                 33.853538   
156                     85.770663                                 78.829749   
157                     89.092996                                 77.075454   
160                     86.405279                                 73.783663   
168                     82.474538                                 70.190504   
181                     72.473454                                 42.838908   
219                     87.293176                                 71.365073   
220                     77.058538                                 21.246246   
221                     76.111439                                 21.735440   
222                     53.359268                                  9.609959   
223                     64.495078                                 14.518233   
224                     64.818275                                 23.839380   
226                     66.376230                                  9.167170   
241                     59.991146                                 30.219435   

     Women treated with respect  Tertiary education attainment  \
20                    71.211593                           18.7   
21                    65.648795                           26.4   
22                    47.779465                           28.1   
23                    65.425224                           21.7   
24                    64.714521                           43.5   
25                    73.040689                           22.9   
73                    80.332996                           53.1   
138                         NaN                           24.2   
139                         NaN                           29.3   
140                         NaN                           22.0   
142                         NaN                           24.8   
144                   59.638205                           24.9   
145                   55.304711                           43.8   
149                   69.291904                           22.7   
151                   67.394845                           22.0   
152                   61.264279                           19.9   
153                   56.124928                           20.6   
154                   70.282286                           22.8   
156                   86.626445                           50.2   
157                   81.755746                           56.9   
160                   59.384062                           22.3   
168                   47.124230                           15.2   
181                   80.405940                           52.3   
219                   59.646254                           22.3   
220                   46.496009                           18.2   
221                   45.435139                           18.9   
222                   34.534991                           14.0   
223                   45.236989                           14.0   
224                   36.360040                           13.7   
226                   36.397542                           17.6   
241                   81.474170                           28.2   

     Lifelong learning  Academic citations per 1000 persons   \
20                 1.0                              0.073352   
21                 1.7                              0.261241   
22                 0.9                              0.367248   
23                 1.9                              0.248473   
24                 2.4                              1.360754   
25                 1.4                              0.366007   
73                31.2                             21.182476   
138                6.0                              0.326001   
139                9.6                              0.196017   
140                6.6                              0.598639   
142                8.0                                   NaN   
144                4.5                                   NaN   
145                7.2                              1.493249   
149                7.0                              0.714710   
151               10.6                              1.214508   
152                8.5                              0.358730   
153                8.2                              1.380873   
154                8.3                              2.154984   
156               10.9                              4.972473   
157               12.6                              6.894654   
160               11.4                              5.340761   
168                6.3                              3.587440   
181               18.1                              5.992004   
219                9.0                              1.898200   
220                7.7                              1.719081   
221                2.5                              0.604914   
222                7.2                              0.996194   
223                6.9                              0.257528   
224                6.1                              0.105659   
226                3.0                              0.403420   
241               10.6                              1.376123   

     gdp_per_capita_2022  
20                8500.0  
21                8900.0  
22               10300.0  
23               11900.0  
24               20800.0  
25                9300.0  
73               90400.0  
138              25300.0  
139              27000.0  
140              15600.0  
142              11500.0  
144              16800.0  
145              31900.0  
149              16200.0  
151              12000.0  
152              11400.0  
153              11400.0  
154              12700.0  
156             120300.0  
157             104100.0  
160              35700.0  
168              20100.0  
181             118700.0  
219              23700.0  
220              13800.0  
221              14100.0  
222               9100.0  
223              11900.0  
224              11300.0  
226              11300.0  
241              14600.0  

Outliers de cada variable en función de IQR

[Figura 00.5]

In [19]:
num_cols = df.select_dtypes(include=['float64', 'int64']).columns

outliers_summary = pd.DataFrame(columns=['Variable', 'Outliers', '% Outliers'])

for col in num_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]

    num_outliers = outliers.shape[0]
    perc_outliers = round((num_outliers / df.shape[0]) * 100,2)

    outliers_summary = pd.concat([outliers_summary, pd.DataFrame({'Variable': [col], 'Outliers': [num_outliers], '% Outliers': [perc_outliers]})], ignore_index=True)

outliers_summary = outliers_summary[outliers_summary['Outliers'] > 0]
print("Figura 00.5")
print()
print(outliers_summary)
Figura 00.5

                                             Variable Outliers  % Outliers
0                                    Infant mortality       15        6.20
1                     Satisfaction with water quality        3        1.24
2                                  Uncollected sewage       59       24.38
3                        Sewage treatment, additional       21        8.68
4                                     Safety at night        5        2.07
7                                      Traffic deaths       11        4.55
8   Share of low-achieving 15 year olds in reading...       60       24.79
9   Share of low-achieving 15 year olds in maths a...       27       11.16
10                    Lower-secondary completion only        9        3.72
11                               Early school leavers        6        2.48
12                                  Broadband at home        5        2.07
13                   Digital skills above basic level       10        4.13
15                                    Internet access       20        8.26
16                                   Freedom of media        6        2.48
17                           Subjective health status        8        3.31
18                     Standardised cancer death rate       12        4.96
19              Standardised heart disease death rate       23        9.50
20            Years of life lost due to air pollution        4        1.65
21                         Index of positive emotions        2        0.83
22                                  Air pollution NO2        8        3.31
23                       Air pollution Ozone (SOMO35)        7        2.89
24                                Air pollution pm2.5        4        1.65
25                              Bathing water quality       17        7.02
28                                Trust in the police       24        9.92
29                  Voiced opinion to public official        1        0.41
32                          Freedom over life choices        8        3.31
34                                  Teenage pregnancy       25       10.33
35  Young people not in education, employment or t...       10        4.13
38                       Tolerance towards immigrants       13        5.37
39                       Tolerance towards minorities        2        0.83
40           Tolerance towards  gay or lesbian people        2        0.83
41                         Women treated with respect        3        1.24
43                                  Lifelong learning       19        7.85
44               Academic citations per 1000 persons         7        2.89
45                                gdp_per_capita_2022        4        1.65

Outliers por pais

[Figura 00.6]

In [20]:
country_column = 'Country'

outliers_summary = pd.DataFrame(columns=['Variable'] + df[country_column].unique().tolist())

for col in num_cols:
    outliers_data = {'Variable': col}

    for country in df[country_column].unique():
        df_country = df[df[country_column] == country]

        Q1 = df_country[col].quantile(0.25)
        Q3 = df_country[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers = df_country[(df_country[col] < lower_bound) | (df_country[col] > upper_bound)]

        num_outliers = outliers.shape[0]

        outliers_data[country] = num_outliers

    outliers_summary = pd.concat([outliers_summary, pd.DataFrame(outliers_data, index=[0])], ignore_index=True)

styled_outliers_summary = outliers_summary.style.background_gradient(cmap='Reds', subset=pd.IndexSlice[:, outliers_summary.columns != 'Variable'])
print("Figura 00.6")
print()
styled_outliers_summary
Figura 00.6

Out[20]:
  Variable AT BE BG CY CZ DE DK EE EL ES FI FR HR HU IE IT LT LU LV MT NL PL PT RO SE SI SK
0 Infant mortality 0 1 1 0 0 1 0 0 1 4 2 5 1 0 0 3 0 0 0 0 1 0 0 1 0 0 1
1 Satisfaction with water quality 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
2 Uncollected sewage 0 1 0 0 0 0 0 0 0 4 0 1 0 0 0 5 0 0 0 0 0 0 1 0 0 0 0
3 Sewage treatment, additional 0 1 0 0 0 2 0 0 0 0 0 3 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
4 Safety at night 0 0 1 0 1 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
5 Money Stolen 0 0 0 0 1 0 0 0 3 0 1 7 0 0 0 5 0 0 0 0 0 0 0 0 2 0 0
6 Assaulted/Mugged 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1
7 Traffic deaths 0 2 0 0 0 0 1 0 1 1 0 2 0 1 0 2 0 0 0 0 0 1 0 0 1 0 0
8 Share of low-achieving 15 year olds in reading (level 1a or lower) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 Share of low-achieving 15 year olds in maths and science 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 Lower-secondary completion only 1 0 1 0 2 0 0 0 0 0 1 5 0 0 0 0 0 0 0 0 2 2 0 1 0 0 0
11 Early school leavers 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
12 Broadband at home 0 0 1 0 1 3 0 0 1 0 0 4 1 0 0 0 0 0 0 0 2 0 0 0 1 0 0
13 Digital skills above basic level 2 0 1 0 1 3 0 0 1 0 0 2 1 0 0 0 0 0 0 0 2 0 0 0 1 0 0
14 Online interaction with public authorities 1 0 1 0 2 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1
15 Internet access 0 0 1 0 1 5 2 0 0 4 0 2 0 0 0 4 0 0 0 0 3 0 0 0 2 0 1
16 Freedom of media 0 0 1 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0
17 Subjective health status 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
18 Standardised cancer death rate 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 2 0 0
19 Standardised heart disease death rate 0 0 0 0 1 3 1 0 4 0 0 2 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
20 Years of life lost due to air pollution 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0
21 Index of positive emotions 2 0 0 0 1 15 1 0 1 0 0 0 1 0 0 5 0 0 0 0 0 0 1 0 0 0 0
22 Air pollution NO2 1 2 1 0 1 0 0 0 1 2 1 1 1 1 0 0 0 0 0 0 0 2 0 2 0 0 0
23 Air pollution Ozone (SOMO35) 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0
24 Air pollution pm2.5 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0
25 Bathing water quality 0 0 0 0 0 2 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
26 Trust in the national government 2 1 0 0 1 1 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0
27 Trust in the legal system 0 1 0 0 1 0 0 0 1 5 0 0 1 0 0 0 0 0 0 0 0 7 0 0 2 0 1
28 Trust in the police 2 0 0 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 0 0 3 0 0 0 2 0 0
29 Voiced opinion to public official 0 0 0 0 0 3 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
30 Female participation in regional assemblies 0 0 0 0 0 1 0 0 1 0 1 2 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1
31 Institution quality index 0 0 0 0 2 1 2 0 0 0 0 4 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
32 Freedom over life choices 2 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 3 2 0 0 0 0 0
33 Job opportunities 2 0 0 0 2 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
34 Teenage pregnancy 1 0 0 0 1 0 0 0 0 0 0 5 0 1 0 2 0 0 0 0 0 0 0 0 0 0 0
35 Young people not in education, employment or training (NEET) 1 0 1 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
36 Institutions corruption index (EQI) 0 1 0 0 1 1 2 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
37 Institution impartiality index (EQI) 0 1 0 0 0 2 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
38 Tolerance towards immigrants 0 0 1 0 1 0 2 0 3 0 1 2 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0
39 Tolerance towards minorities 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
40 Tolerance towards gay or lesbian people 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
41 Women treated with respect 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0
42 Tertiary education attainment 1 0 1 0 2 1 1 0 2 0 1 1 0 2 0 0 0 0 0 0 1 1 1 1 0 0 1
43 Lifelong learning 1 1 0 0 1 0 1 0 1 0 2 3 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
44 Academic citations per 1000 persons 0 0 2 0 1 0 1 0 0 0 2 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1
45 gdp_per_capita_2022 0 0 1 0 1 2 1 0 1 0 1 3 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1

[Figura 00.7]

In [21]:
num_cols= df.select_dtypes(include=['float64', 'int64'])

country_col = 'Country'

print("Figura 00.7")

for col in num_cols:
    plt.figure(figsize=(12, 6))

    sns.boxplot(x='Country', y=col, data=df, color='#9ecae1')

    mean = df[col].mean()
    plt.axhline(mean, color='red', linestyle='--', linewidth=1, label=f'Media ({mean:.2f})')

    plt.title(col)

    plt.grid(True)

    plt.show()
Figura 00.7
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [22]:
print(outliers_df.Country.nunique())
outliers_df.RegionName.nunique()
11
Out[22]:
31

Con Z Score se observan outliers en un total de 31 regiones de 11 paises. Es razonable pensar que existen unas pocas regiones por países con datos muy distantes a su media por distintas razones, geográficas o sociales

En el mapa observamos que la mayoría de los outliers se concentran en regiones muy especificas:

  • Europa del Este: sobre todo Bulgaría, Rumanía y regiones adyacentes.
  • Irlanda
  • Territorios de ultramar de Francia y Portugal

A parte de estas tres zonas, se observan otras zonas muy repartidas sin sin relación aparente, al menos geográficamente

In [23]:
merged_gdf =pd.merge(gdf, outliers_df, how='right', left_on='NUTS_ID', right_on='NUTS code')

[Figura 00.8]

In [24]:
print("Figura 00.8")
fig, ax = plt.subplots(figsize=(12, 10))

ax.set_facecolor('lightgrey')

gdf.plot(ax=ax, color='#e5f5f9', edgecolor='gray', linewidth=0.5)  # Mapa base
merged_gdf.plot(ax=ax, color='red', markersize=50, label='Outliers')  # Outliers en rojo
ax.set_title('Outliers')

plt.show()
Figura 00.8
No description has been provided for this image

Distribución de los indicadores¶

[Figura 00.9]

In [25]:
print("Figura 00.9")

num_cols = df.select_dtypes(include=['number']).columns

fig, axs = plt.subplots(ncols=3, nrows=16, figsize=(20, 60))
axs = axs.flatten()

index = 0
for col in df[num_cols]:
    sns.distplot(df[col], bins=20, ax=axs[index])
    index += 1

plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)
Figura 00.9
No description has been provided for this image

Correlación entre indicadores¶

[Figura 00.10]

In [26]:
print("Figura 00.10")

num_col = df.select_dtypes(include='number')

correlation_matrix = num_col.corr()

plt.figure(figsize=(12, 12))
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
Figura 00.10
No description has been provided for this image

Guardar el dataset¶

In [27]:
df.to_csv('00_lectura_datos.csv', index = False)
In [ ]: